Nonlinear Solver

Generalities

Solvers for nonlinear problems can be of several nature. We differentiate two major families:

  • Filters

  • Optimizers

In WOLF, we are concerned mostly with optimizers, as they perform better. We have devoted a full section to them.

We highlight below the basic characteristics of filters and optimizers.

Filters

Filters, such as the Extended Kalman filter, are based on the Bayesian assumption and proceed in stages:

  1. The prediction stage takes into account the passage of time:

    \[x_t = f ( x_{t-1}, u_t, n_t )\]

    where \(x\) is the state, \(u\) is the control signal, and \(n\) is a random perturbation (typically of zero mean and Gaussian distribution). The function \(f(\cdot)\) is the prediction function, and is nonlinear.

    In SLAM, motion is modelled with a prediction function, because it involves the passage of time. We therefore call \(f(\cdot)\) the motion model.

    The prediction step defines two equations: one for the estimate mean, and one for its covariance,

    \[\begin{split}\begin{align} \hat x_t &= f ( \hat x_{t-1}, u_t, 0 ) \\\\\ P_t &= F_x\,P_{t-1}\,F_x^\top + F_n\,N\,F_n^\top \end{align}\end{split}\]

    with the Jacobians \(F_x = \frac{\partial f}{\partial x}\) and \(F_n=\frac{\partial f}{\partial n}\), and the perturbation covariance \(N\).

  2. The correction stage incorporates the information of the sensors:

    \[y = h(x) + v\]

    where \(v\) is a measurement noise, typically of zero mean and Gaussian distribution. The function \(h(\cdot)\) is known as the observation model, and is also nonlinear.

    The correction step is a little longer. First we compute the innovation and its covariance

    \[\begin{split}\begin{align} z &= y - h(\hat x) \\\\\ Z &= H\,P\,H^\top + R \end{align}\end{split}\]

    with the Jacobian \(H=\frac{\partial h}{\partial x}\) and the noise covariance \(R\); then the Kalman gain

    \[K = P\,H^\top\,Z^{-1}\]

    and finally update mean and covariance,

    \[\begin{split}\begin{align} \hat x &\gets \hat x + K\,z \\\\\ P &\gets P - K\,Z\,K^\top \end{align}\end{split}\]

Optimizers

Optimizers such as those based on factor graphs, solve all the problem at once, by considering states as nodes in the graph and the factors between the states:

  1. Motion is transformed into a factor by expressing the prediction error \(e\), for example, using the same motion model \(f(\cdot)\),

    \[e_t = \hat x_t - f(\hat x_{t-1}, u_t)\]

    Clearly, this residual uses the same motion model \(f(\cdot)\) as in the filtering case. Being a prediction, the noise \(n\) is zero, and we omitted it in \(f(\cdot)\).

    Alternatively, we can use an implicit model \(g(\cdot)\) so that the error can be expressed as a function of all the involved states plus the control,

    \[e_t = g(\hat x_{t-1}, \hat x_t, u_t)\]

    this error is weighted by its uncertainty to build the residual \(r_k\),

    \[r_t = \Omega_t^{\top/2} e_t\]

    where \(\Omega_t = N^{-1}\) is the information matrix of the factor \(t\), computed as the inverse of the covariances matrix \(N\) of the perturbation \(n\).

  2. Exteroceptive observations are considered alike, that is, by computing the error between the measurement \(y\) and the prediction of this measurement, \(h(x)\).

    \[e_k = y_k - h(\hat x)\]

    Now, this error uses the same observation model \(h(\cdot)\) as in the filtering case. We also omitted the noise \(v\) in the observation model as we are dealing with predictions.

    The residual is built as before,

    \[r_k = \Omega_k^{\top/2} e_k\]

    where we have \(\Omega_k = R^{-1}\), with \(R\) the covariance of the measurement noise \(v\).

All factors, regardless of their type, are collected in a single cost function which is minimized to solve the least-squares estimation problem,

\[\hat x^* = \textrm{argmin}_{\hat x} \sum_k || r_k(\hat x) ||^2\]

The optimization problem can be solved iteratively by linearizing the residuals around a current estimate \(\hat x\),

\[r_k(\hat x+dx) \approx r_k(\hat x) + J_k dx\]

with \(J_k=\frac{\partial r_k}{\partial x}\). This yields

\[\begin{split}\begin{align} dx^* &= \textrm{argmin}_{dx} \sum_k ||{r_k(\hat x+dx)}||^2 = \textrm{argmin}_{dx} \sum_k ||r_k(\hat x) + J_k dx||^2 \\\\\ \hat x &\gets \hat x + dx^* \end{align}\end{split}\]

The two steps above are iterated until convergence.

To solve the first step, the problem can be manipulated as follows. First stack all residuals in a large vector, and all Jacobians in a large matrix,

\[\begin{split}\begin{align} r &= \begin{bmatrix}r_1 \\\\\ \vdots \\\\\ r_K \end{bmatrix} & J &= \begin{bmatrix}J_1 \\\\\ \vdots \\\\\ J_K \end{bmatrix} \end{align}\end{split}\]

since \(||r(\hat x) + J dx||^2 = \sum_k ||r_k(\hat x) + J_k dx||^2\), the problem now reduces to

\[\begin{align} dx^* &= \textrm{argmin}_{dx} ||r(\hat x) + J dx||^2 \end{align}\]

which can be solved using the pseudo-inverse of \(J\) if the problem is small,

\[dx^* = - (J^\top J)^{-1} J^\top r\]

or using some matrix factorization (QR or Cholesky being the most popular ones) if the problem is large.